NTU-MC Toolkit: Annotating a Linguistically Diverse Corpus
نویسندگان
چکیده
The NTU-MC Toolkit is a compilation of tools to annotate the Nanyang Technological University Multilingual Corpus (NTU-MC). The NTU-MC is a parallel corpora of linguistically diverse languages (Arabic, English, Indonesian, Japanese, Korean, Mandarin Chinese, Thai and Vietnamese). The NTU-MC thrives on the mantra of "more data is better data and more annotation is better information". Other than increasing parallel data from diverse language pairs, annotating the corpus with various layers of information allows corpora linguists to discover linguistic phenomena and provides computational linguists with pre-annotated features for various NLP tasks. In addition to the agglomeration existing tools into a single python wrapper library, we have implemented three tools (Mini-segmenter, GaChalign and Indotag) that (i) provides users with varying analysis of the corpus, (ii) improves the state-of-art performance and (iii) reimplements a previously unavailable annotation tool as a free and open tool. This paper briefly describes the wrapper classes available in the toolkit and introduces and demonstrates the usage of the Mini-segmenter, GaChalign and Indotag.
منابع مشابه
Tan Liling and Francis Bond . Building and Annotating the Linguistically Diverse NTU - MC ( NTU – Multilingual Corpus )
The NTU-MC compilation taps on the linguistic diversity of multilingual texts available within Singapore. The current version of NTU-MC contains 375,000 words (15,000 sentences) in 6 languages (English, Chinese, Japanese, Korean, Indonesian and Vietnamese) from 6 language families (Indo-European, Sino-Tibetan, Japonic, Korean as a language isolate, Austronesian and Austro-Asiatic). The NTU-MC i...
متن کاملBuilding and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)
The NTU-MC compilation taps on the linguistic diversity of multilingual texts available within Singapore. The current version of NTU-MC contains 595,000 words (26,000 sentences) in 7 languages (Arabic, Chinese, English, Indonesian, Japanese, Korean and Vietnamese) from 7 language families (Afro-Asiatic, Sino-Tibetan, Indo-European, Austronesian, Japonic, Korean as a language isolate and Austro-...
متن کاملMEETING STRUCTURE ANNOTATION Annotations Collected with a General Purpose Toolkit
We describe a generic set of tools for representing, annotating, and analyzing multi-party discourse, including: an ontology of multimodal discourse, a programming interface for that ontology, and NOMOS – a flexible and extensible toolkit for browsing and annotating discourse. We describe applications built using the NOMOS framework to facilitate a real annotation task, as well as for visualizi...
متن کاملMeeting Structure Annotation: Data and Tools
We present a set of annotations of hierarchical topic segmentations and action item subdialogues collected over 65 meetings from the ICSI and ISL meeting corpora, designed to support automatic meeting understanding and analysis. We describe an architecture for representing, annotating, and analyzing multi-party discourse, including: an ontology of multimodal discourse, a programming interface f...
متن کاملThe MC-value for monotonic NTU-games
The MC-value is introduced as a new single-valued solution concept for monotonic NTU-games. The MC-value is based on marginal vectors, which aze extensions of the well-known marginal vectors for TU-games and hyperplane games. As a result of the definition it follows that the MC-value coincides with the Shapley value for TU-games and with the consistent Shapley value for hyperplane games. It is ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014